Please fill out the anonymous electronic course evaluation. Feel free to leave your feedback, the course can always improve thanks to students’ input!
12/03/2020
Please fill out the anonymous electronic course evaluation. Feel free to leave your feedback, the course can always improve thanks to students’ input!
Consider a simple linear model \[Y = \beta_{0} + X_{1} \beta_{1} + \dots + X_{p} \beta_{p} + \varepsilon\]
The statement above defines a sampling model for the observed data, given a set of unknown parameters \(\boldsymbol{\theta} = (\beta_{0}, \beta_{1}, \dots, \beta_{p}, \sigma^{2})\), i.e.Â
\[p(Y_{i} \mid \boldsymbol{\theta} ) \sim N(\beta_{0} + x_{i1} \beta_{1} + \dots + x_{ip} \beta_{p}, \sigma^{2})\]
Our inferential goal is often that of obtaining point estimators \(\hat{\boldsymbol{\theta}}\), interval estimators \((\hat{\boldsymbol{\theta}}_{L}, \hat{\boldsymbol{\theta}}_{U})\) and test hypotheses about \(\boldsymbol{\theta}\).
Classical (frequentist) statistical inference is based on the assumption that \(\boldsymbol{\theta}\) is fixed, and \(Y\) is random (even after being observed).
A point estimator \(\hat{\boldsymbol{\theta}} = \boldsymbol{\theta}(Y)\) is a summary of the data.
Under the assumption of repeated sampling (if I could see a countable sequence of new datasets) \(\hat{\boldsymbol{\theta}}\) is random and has a sampling distribution.
For example, if the parameter of interest is \(\theta = \text{average height}\), we use \(\hat{\theta} = \bar{x}\) and \(\hat{\theta} \sim N(\bar{x}, \sigma^{2} / n)\).
Possible scientific question:
Note: this seemingly innocuous question cannot be answered with the tools of classical statistics.
The classical shortcut to this question is to set up a testing scenario \(H_{0}: \theta \leq 0.5\), \(H_{1}: \theta > 0.5\).
Consider a Z-test:
\[Z = \frac{\hat{\theta} - 0.5}{\sqrt{0.5 (1 - 0.5)}}\]
With p-value = \(P(Z > z) = 1 − \Phi(z)\), the answer is now binary: accept or reject \(H_{0}\).
Problems:
Why using Bayesian methods?
Why do people use classical methods?
Many reasons classical methods are more common than Bayesian methods are historical:
Frequentist setting:
Bayesian setting:
Suppose we observe \(y\) is the number of infected individuals in a sample of size \(n\). We want to estimate the prevalence of a disease \(\theta\). A reasonable sampling model is
\[y \sim \text{Binomial}(n, \theta)\]
\[p(y \mid \theta) = {n\choose y} \theta^{y} (1 - \theta)^{n - y}\]
A reasonable prior for \(\theta\) could belong to the Beta family, so that
\[\theta \sim \text{Beta}(a, b)\]
\[p(\theta) = C \theta^{a−1} (1 - \theta)^{b−1}\]
It turns out that this prior is conjugate to the Binomial likelihood.
A flexible family of distributions representing prior understanding about a proportion.
The posterior distribution is simply
\[p(\theta \mid y) \propto p(y \mid \theta) p(\theta) = \theta^{a−1} (1 - \theta)^{b−1} \theta^{y} (1 - \theta)^{n-y}\]
Which we recognize as the kernel of a Beta random variable
\[\theta \mid y \sim \text{Beta}(a + y, b + n - y)\]
Suppose we see 1 infection among 20 patients and assume a relative prior ignorance, so that \(\theta \sim \text{Beta}(1, 1)\); then we get \(p(\theta \mid y) = \text{Beta}(2,20)\)
Frequentist Intervals (Confidence Intervals)
Bayesian Intervals (Credible Intervals)
When the number of predictors \(p\) is large, it is often necessary to regularize the likelihood by defining penalties on the regression coefficients \(\boldsymbol{\beta}\). Penalized log likelihood functions take the form
\[\ell_{\lambda} (y; \boldsymbol{\beta}, \sigma^{2}) = \sum_{i=1}^{n} \log p(y_{i} \mid \boldsymbol{\beta}, \sigma^{2}) - g_{\lambda}(\boldsymbol{\beta})\]
where \(g_{\lambda}(\boldsymbol{\beta})\) is a penalty function and \(\lambda\) is a tunable scale parameter.
Examples:
Frequentist inference usually treats \(\lambda\) as a nuisance parameter and estimates \(\boldsymbol{\beta}\) given a value of \(\lambda\) The “optimal” \(\lambda\) is then selected via cross-validation.
We notice that
\[L_{\lambda} (y; \boldsymbol{\beta}, \sigma^{2}) = \prod_{i=1}^{n} p(y_{i} \mid \boldsymbol{\beta}, \sigma^{2}) \times e^{- g_{\lambda}(\boldsymbol{\beta})}\]
which is readily interpreted as
\[p(Y \mid \boldsymbol{\beta}, \sigma^{2}) p(\boldsymbol{\beta} \mid \lambda)\]
where \(p(\boldsymbol{\beta} \mid \lambda) \propto e^{- g_{\lambda}(\boldsymbol{\beta})}\) is the prior.
Bayesian methods naturally shrink the parameter values thanks to the incorporation of the prior effect! They are more robust and overfit less often.